New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Tesseract training setup scripts and example data #339

Draft

Penguin2600 wants to merge 2 commits into TwoAbove:develop from Penguin2600:add_tesstrain

Penguin2600 commented Mar 4, 2024

Work In progress, opening for visibility.

Current status:

tessTrain/tessTrain.sh - works and will set up a baseline ubuntu 22.04 wsl / container / etc with the tools and binaries required for training tesseract. It will also run an example training session with the included example training data. Documentation and sources are commented inside the script for further details look there for now.

tessTrain/example_truth/ - Example of what a training data directory needs to look like. Used by tessTrain.sh to confirm that setup was successful.

Ping me on discord for any questions or comments! Thx.


          Add Tesseract training setup scripts and example data

TwoAbove reviewed

View reviewed changes

Owner

TwoAbove left a comment

Looks good!

Left a couple of minor comments.

I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?

dataScripts/tessTrain/tessTrain.sh

Comment on lines +25 to +27

+              sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev
+              sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
+              sudo apt-get install libpango1.0-dev libleptonica-dev

Owner

TwoAbove Mar 5, 2024

I think it would make sense to extract this into the README as a ## training tesseract section.

I would split this script into two parts - a setup.sh script (also mention it in the README in the setup instructions) and a train.sh script that takes in a ground truth path.

Author

Penguin2600 Mar 5, 2024

Totally, I had the same thought. one will likely run once while the other may need many runs.

dataScripts/tessTrain/tessTrain.sh

+              greentext "Installing Deps and Creating File Structure"
+              # Dont polute the directory
+              mkdir -p ./tess

Owner

TwoAbove Mar 5, 2024

Since this script creates artifacts, we'll need to add them to a .gitignore file. Ideally, we would keep the sole .gitignore so it's consolidated in one place.

Author

Penguin2600 Mar 5, 2024 •

edited

Loading

Good Callout, I'll consider what the new entries might need to be.

dataScripts/tessTrain/tessTrain.sh Outdated Show resolved Hide resolved

dataScripts/tessTrain/tessTrain.sh

Comment on lines +37 to +39

+              sudo apt-get install libicu-dev
+              sudo apt-get install libpango1.0-dev
+              sudo apt-get install libcairo2-dev

Owner

TwoAbove Mar 5, 2024

These too

dataScripts/tessTrain/tessTrain.sh

+              greentext "Pulling the required ENG traineddata from github"
+              wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
+              sudo mv eng.traineddata /usr/local/share/tessdata

Owner

TwoAbove Mar 5, 2024

Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use TESSDATA variable to make it local. https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#train

Author

Penguin2600 Mar 5, 2024

Yes, I think that's a great idea, will incorporate.


          Update dataScripts/tessTrain/tessTrain.sh

073ff03

Co-authored-by: Seva Maltsev <[email protected]>

Author

Penguin2600 commented Mar 5, 2024

I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?

Short Answer:
I don't know.

Long answer:
My best understanding is that if it tags that partial M as a glyph (box generation step) then it may, in the training, try and label it. Which it will almost certainly fail at since it is not present in the corresponding gt.txt file. This will increase the resulting training error due the mismatch which may in a sort of artificial way cause that training data to be ignored in the training. Ideally we should give it perfectly clean images and a perfectly clean matching ground_truth text. In this case depending on the training execution variables I see 90-98% success rates on validation data out of the minimal example training set.

TwoAbove mentioned this pull request

BUG - Live game helper not capturing the correct seed number #373

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet